Info

In this file, I will perform exploratory data analyses (EDA) and subsequent data wrangling to get the data in proper shape for downstream modeling.

What is EDA?

From R4DS, chapter 7: Exploratory Data Analysis:

EDA is an iterative cycle. You:

  1. Generate questions about your data.
  2. Search for answers by visualising, transforming, and modelling your data.
  3. Use what you learn to refine your questions and/or generate new questions.

EDA is not a formal process with a strict set of rules. More than anything, EDA is a state of mind. During the initial phases of EDA you should feel free to investigate every idea that occurs to you. Some of these ideas will pan out, and some will be dead ends. As your exploration continues, you will home in on a few particularly productive areas that you’ll eventually write up and communicate to others.

EDA is an important part of any data analysis, even if the questions are handed to you on a platter, because you always need to investigate the quality of your data. Data cleaning is just one application of EDA: you ask questions about whether your data meets your expectations or not. To do data cleaning, you’ll need to deploy all the tools of EDA: visualisation, transformation, and modelling.

Load libraries

First, we load necessary libraries.

library(tidyverse) # A whole bunch of packages necessary for data analysis
library(broom) # To effectively clean up various objects
library(ggforce) # Has some nice ggplot functionality

Data

Load data from the previous step.

variables <- readRDS(file = "../data/processed/variables.rds")
annotation <- readRDS(file = "../data/processed/annotation.rds")
data_comb <- readRDS(file = "../data/processed/data_comb.rds")

Clinical data

Table 1

tableone

Use tableone to get a quick feeling on the population.

tableone::CreateTableOne(
  data = data_comb %>% filter(time == "base"), 
  vars = variables$clinic, 
  strata = "group", 
  factorVars = variables$clinic_fac
)
##                              Stratified by group
##                               control         intervention    p      test
##   n                               52              47                     
##   age (mean (SD))              55.19 (9.79)    53.62 (9.68)    0.424     
##   gender = man (%)                21 (40.4)       20 (42.6)    0.988     
##   hypertensive_med = yes (%)       2 ( 3.8)        3 ( 6.4)    0.908     
##   myocardialinfarction (%)                                     0.668     
##      no                           39 (76.5)       36 (80.0)              
##      yes                           9 (17.6)        8 (17.8)              
##      unknown                       3 ( 5.9)        1 ( 2.2)              
##   smoking (%)                                                  0.165     
##      regularly                     8 (15.7)        2 ( 4.3)              
##      seldom                        0 ( 0.0)        2 ( 4.3)              
##      quit                         11 (21.6)       12 (25.5)              
##      never                        31 (60.8)       31 (66.0)              
##      snuff                         1 ( 2.0)        0 ( 0.0)              
##   smoking2 = yes (%)               9 (17.6)        4 ( 8.5)    0.301     
##   height (mean (SD))          172.95 (9.12)   172.06 (9.16)    0.630     
##   weight (mean (SD))           74.05 (13.14)   75.69 (12.66)   0.530     
##   bmi (mean (SD))              24.63 (3.05)    25.44 (2.92)    0.180     
##   waist (mean (SD))            88.61 (10.56)   90.53 (10.45)   0.365     
##   hip (mean (SD))             102.00 (6.28)   102.21 (5.44)    0.859     
##   whratio (mean (SD))           0.87 (0.07)     0.88 (0.08)    0.293     
##   sbp (mean (SD))             127.56 (14.24)  128.23 (11.86)   0.799     
##   dbp (mean (SD))              76.17 (9.18)    78.36 (8.10)    0.214     
##   pulse (mean (SD))            63.52 (9.36)    64.45 (7.13)    0.583     
##   totalfatpercent (mean (SD))  29.23 (8.33)    30.81 (7.73)    0.329     
##   totalfatweight (mean (SD))   21.63 (7.46)    23.10 (6.43)    0.297     
##   fatfreeweight (mean (SD))    52.43 (11.33)   52.56 (11.72)   0.958     
##   leuko (mean (SD))             5.42 (1.40)     5.29 (1.35)    0.620     
##   neutro (mean (SD))            2.88 (0.87)     2.84 (1.10)    0.879     
##   lymph (mean (SD))             1.89 (0.56)     1.80 (0.51)    0.385     
##   mono (mean (SD))              0.46 (0.14)     0.45 (0.12)    0.683     
##   eosino (mean (SD))            0.15 (0.12)     0.17 (0.12)    0.451     
##   creatinine (mean (SD))       75.48 (12.08)   74.26 (12.40)   0.620     
##   asat (mean (SD))             23.65 (10.06)   21.89 (7.82)    0.337     
##   alat (mean (SD))             27.17 (11.93)   27.17 (12.10)   0.999     
##   ggt (mean (SD))              24.53 (29.20)   27.49 (22.10)   0.587     
##   alp (mean (SD))              69.38 (21.15)   74.28 (16.73)   0.208     
##   hba1c (mean (SD))             5.36 (0.23)     5.29 (0.25)    0.175     
##   glucose (mean (SD))           5.23 (0.37)     5.29 (0.39)    0.458     
##   insulin (mean (SD))          50.62 (28.18)   58.94 (27.54)   0.141     
##   tg (mean (SD))                1.31 (0.60)     1.27 (0.49)    0.758     
##   tc (mean (SD))                6.58 (0.82)     6.55 (0.82)    0.886     
##   hdlc (mean (SD))              1.72 (0.43)     1.70 (0.45)    0.809     
##   ldlc (mean (SD))              4.12 (0.64)     4.18 (0.64)    0.678     
##   ldlhdlratio (mean (SD))       2.55 (0.74)     2.64 (0.79)    0.554     
##   apoa1 (mean (SD))             1.62 (0.20)     1.59 (0.22)    0.510     
##   apob (mean (SD))              1.31 (0.20)     1.30 (0.17)    0.724     
##   aporatio (mean (SD))          0.82 (0.14)     0.83 (0.15)    0.811     
##   lpa (mean (SD))             296.29 (283.24) 285.86 (319.50)  0.869     
##   crp (mean (SD))               1.45 (1.26)     1.57 (1.57)    0.689     
##   tsh (mean (SD))               2.03 (0.78)     1.84 (0.81)    0.238     
##   vitd (mean (SD))             74.50 (18.80)   73.64 (21.22)   0.831     
##   vitd3 (mean (SD))            74.50 (18.80)   69.36 (20.63)   0.198

The CreateTableOne function is part of the tableone package. It has a bunch of useful functionality. To learn more, check out the tableone vignette! (A vignette is a short tutorial that gives you a feeling of how you can use a specific package and its functions; they are often very useful when well-written.)

dplyr

Numerical/continuous variables

Alternatively, we can create our own table one. This gives us more flexibility about what type of summary statistics we wish to present. To this, we use the standard toolkit: dplyr and tidyr.

table_num <- data_comb %>% 
  
  # Remove the change; keep baseline and end-of-study only
  filter(time != "delta") %>% 
  
  # Select the numerical clinical variables
  select(time, group, variables$clinic_num) %>% 
  
  # Group by timepoint and group
  group_by(time, group) %>% 
  
  # Initiate call to summarise
  summarise_all(
    
    # Use the funs selector to tell summarise what summary stats you want
    .funs = funs(
      
      # First add number of non-NAs; in other words, n for each variable
      n = sum(!is.na(.)), 
      
      # Add a measure of that variables skewness
      skewness = e1071::skewness, 
      
      # Then add a few other specific functions
      mean, sd, median, IQR), 
    
    # Remove NAs for all these functions
    na.rm = TRUE
  
    ) %>% 
  
  # Since the output is useless in the current format, 
  # we start to clean up by pivoting
  pivot_longer(cols = -one_of("time", "group"), names_to = "variable", values_to = "values") %>% 
  
  # Then split "variable" by the separator "_"
  separate(col = variable, into = c("variable", "stat"), sep = "_") %>% 
  
  # And create a new column "key" by binding together "time", "group" and "stat" by that same separator "_"
  unite(col = "key", time, group, stat) %>% 
  
  # Finally pivot wider again, using the newly formed "key" column
  pivot_wider(names_from = "key", values_from = "values") %>% 
  
  # We might also want to clean up this a bit, into something a bit more readable
  
  # Start by rounding off numbers
  mutate_at(
    .vars = vars(matches("skewness|mean|sd|median|IQR")), 
    .funs = round, 
    digits = 2
  ) %>% 
  
  # Then create text strings of both the measure of centrality and variance
  mutate(
    
    # For baseline
    `base_control, mean (SD)` = paste0(base_control_mean, " (", base_control_sd, ")"), 
    `base_control, median (IQR)` = paste0(base_control_median, " (", base_control_IQR, ")"), 
    
    `base_intervention, mean (SD)` = paste0(base_intervention_mean, " (", base_intervention_sd, ")"), 
    `base_intervention, median (IQR)` = paste0(base_intervention_median, " (", base_intervention_IQR, ")"), 
    
    # For end of study
    `end_control, mean (SD)` = paste0(end_control_mean, " (", end_control_sd, ")"), 
    `end_control, median (IQR)` = paste0(end_control_median, " (", end_control_IQR, ")"), 
    
    `end_intervention, mean (SD)` = paste0(end_intervention_mean, " (", end_intervention_sd, ")"), 
    `end_intervention, median (IQR)` = paste0(end_intervention_median, " (", end_intervention_IQR, ")")
    
    ) %>% 
  
  select(variable, matches("_n|\\("))

table_num

And that’s it, for the numerical variables. Now off to the categorical variables, or as R knows them: the factors.

Factor/categorical variables
table_fac <- data_comb %>% 
  
  # Remove the change; keep baseline and end-of-study only
  filter(time != "delta") %>% 
  
  # Select the numerical clinical variables
  select(time, group, variables$clinic_fac) %>% 
  
  # Pivot the variables to longer format
  pivot_longer(cols = -one_of("time", "group"), names_to = "variable", values_to = "value") %>% 
  
  # Group by time, group and variable
  group_by(time, group, variable) %>% 
  
  # Count occurences within each grouping
  count(value) %>% 
  
  # Add percentage within each grouping
  mutate(perc = round((n / sum(n)) * 100, digits = 1)) %>% 
  
  # Pivot those values longer too
  pivot_longer(cols = n:perc, names_to = "names", values_to = "value2") %>% 
  
  # Unite the grouping keys
  unite(col = "key", time, group, names) %>% 
  
  # Pivot wider to a format that's easy to read
  pivot_wider(names_from = "key", values_from = "value2") %>% 
  
  # Add the final summary column for each timepoint and group
  mutate(
    
    # For baseline
    `base_control, n (%)` = paste0(base_control_n, " (", base_control_perc, ")"), 
    `base_intervention, n (%)` = paste0(base_intervention_n, " (", base_intervention_perc, ")"), 
    
    # For end of study
    `end_control, n (%)` = paste0(end_control_n, " (", end_control_perc, ")"), 
    `end_intervention, n (%)` = paste0(end_intervention_n, " (", end_intervention_perc, ")")
    
  ) %>% 
  
  # Select only the final summaries
  select(variable, value, matches("%"))

table_fac

It’s a bit long, but it works fine.

Tests

We could add a simple test at the end, just to see if there are any differences between the groups at these two timepoints. Let’s do that with a t-test.

Numerical/continuous variables

This is how we do it for a single variable at baseline.

data_comb %>% 
  filter(time == "base") %>% 
  t.test(sbp ~ group, data = .) %>% 
  broom::tidy() %>% 
  pull("p.value")
## [1] 0.7972039

Now let’s scale up and do it for many variables at both timepoints.

table_num_p <- data_comb %>% 
  filter(time != "delta") %>% 
  select(time, group, variables$clinic_num) %>% 
  pivot_longer(cols = variables$clinic_num, names_to = "variable") %>% 
  group_by(time, variable) %>%
  nest() %>% 
  mutate(
    p.value = map_dbl(data, 
                      ~t.test(value ~ group, data = .x) %>% 
                        broom::tidy() %>% 
                        pull("p.value"))
    ) %>% 
  select(-data) %>% 
  pivot_wider(names_from = "time", values_from = "p.value") %>% 
  select(variable, base_p.value = base, end_p.value = end)

table_num_p
Factor/categorical variables

For the factors, we use the Chi-Squared test. This is how we do it for one variable.

x_gender <- data_comb %>% filter(time == "base") %>% pull("gender")
y_group <- data_comb %>% filter(time == "base") %>% pull("group")

chisq.test(x_gender, y_group) %>% 
  broom::tidy() %>% 
  pull("p.value")
## [1] 0.9884747

And this is how we scale up.

table_fac_p <- data_comb %>% 
  filter(time != "delta") %>% 
  select(time, group, variables$clinic_fac) %>% 
  pivot_longer(cols = variables$clinic_fac, names_to = "variable") %>% 
  group_by(time, variable) %>%
  nest() %>% 
  mutate(
    p.value = map_dbl(data, 
                      ~chisq.test(x = .x$value, y = .x$group) %>% 
                        broom::tidy() %>% 
                        pull("p.value"))
    ) %>% 
  select(-data) %>% 
  pivot_wider(names_from = "time", values_from = "p.value") %>% 
  select(variable, base_p.value = base, end_p.value = end)

table_fac_p

Export summary tables to Excel

And finally, all these data can be saved to an Excel file so that you can prep it according to journal standards.

list(
  "numerical" = table_num, 
  "numerical_p" = table_num_p, 
  "factor" = table_fac, 
  "factor_p" = table_fac_p
) %>% 
  openxlsx::write.xlsx(file = "../results/tables/table-clinic.xlsx", colWidths = "auto")

Visualizations

Univariate

Numerical variables

Let’s look at BMI using a histogram.

data_comb %>% 
  ggplot(aes(bmi, fill = group)) + 
  geom_histogram() + 
  scale_fill_brewer(palette = "Set1") + 
  facet_wrap(~time, scales = "free")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Is this better displayed with a density plot?

data_comb %>% 
  ggplot(aes(bmi, color = group)) + 
  geom_density() + 
  scale_color_brewer(palette = "Set1") + 
  facet_wrap(~time, scales = "free")

Perhaps - let’s do a boxplot instead.

data_comb %>% 
  ggplot(aes(group, bmi, color = group)) + 
  geom_boxplot() + 
  scale_color_brewer(palette = "Set1") + 
  facet_wrap(~time, scales = "free")

I like the boxplots. But some information is redundant now. Remove the colors from the boxes, and remove the legend. Also, add jittered points on top. Those we can color.

data_comb %>% 
  ggplot(aes(group, bmi)) + 
  geom_boxplot(outlier.alpha = 0) + 
  geom_point(aes(color = group), position = position_jitter(), show.legend = FALSE) + 
  scale_color_brewer(palette = "Set1") + 
  facet_wrap(~time, scales = "free")

Clean up the theme a bit.

data_comb %>% 
  mutate(time = case_when(
    time == "base" ~ "Baseline", 
    time == "end" ~ "End of study", 
    time == "delta" ~ "Change") %>% 
      fct_relevel("Baseline", "End of study")) %>% 
  ggplot(aes(group, bmi)) + 
  geom_boxplot(outlier.alpha = 0) + 
  geom_point(aes(color = group), position = position_jitter(), show.legend = FALSE) + 
  scale_color_brewer(palette = "Set1") + 
  facet_wrap(~time, scales = "free") + 
  theme_classic() + 
  theme(strip.background = element_blank()) + 
  labs(x = "Group", y = "BMI (kg/m2)")

Factor variables
data_comb %>% 
  filter(time != "delta") %>% 
  select(time, group, variables$clinic_fac) %>% 
  pivot_longer(cols = -one_of("time", "group")) %>% 
  group_by(time, group, name) %>% 
  count(value) %>% 
  mutate(perc = n/sum(n) * 100) %>% 
  ggplot(aes(value, perc, fill = group)) + 
  geom_col(position = position_dodge()) + 
  scale_fill_brewer(palette = "Set1") + 
  coord_flip() + 
  facet_wrap(~name, scales = "free") + 
  labs(x = NULL, y = "Percentage (%)")

Change from baseline to end-of-study
summary_ldlc <- data_comb %>% 
  filter(time %in% c("base", "end")) %>% 
  group_by(time, group) %>% 
  summarize(y = mean_cl_normal(ldlc)[[1]], 
            ymin = mean_cl_normal(ldlc)[[2]], 
            ymax = mean_cl_normal(ldlc)[[3]])
## Registered S3 method overwritten by 'htmlwidgets':
##   method           from         
##   print.htmlwidget tools:rstudio
## Registered S3 method overwritten by 'data.table':
##   method           from
##   print.data.table
data_comb %>% 
  filter(time %in% c("base", "end")) %>% 
  ggplot(aes(time, ldlc)) + 
  geom_line(aes(group = id), color = "grey") + 
  
  # I can either add the manually calculated summary stats
  geom_line(data = summary_ldlc, aes(y = y, color = group, group = group)) +
  geom_pointrange(data = summary_ldlc, aes(y = y, ymin = ymin, ymax = ymax, color = group)) +
  
  # Or use a pair of simple statistical summary functions
  stat_summary(aes(color = group, group = group), fun.y = "mean", geom = "line") + 
  stat_summary(aes(color = group, group = group), fun.data = "mean_cl_normal") + 
  
  scale_color_brewer(name = NULL, palette = "Set1") +
  facet_wrap(~group) + 
  theme_classic() + 
  theme(strip.background = element_blank()) + 
  labs(x = NULL, y = "LDL-C (mmol/L)")

Or by a simple waterfall chart.

data_comb %>% 
  filter(time == "delta") %>% 
  mutate(id.row = row_number() %>% factor()) %>%  
  ggplot(aes(fct_reorder(id.row, ldlc), ldlc)) + 
  geom_col(aes(fill = group)) + 
  scale_fill_brewer(name = NULL, palette = "Set1") +
  theme_classic() + 
  theme(axis.text.x = element_blank(), 
        axis.ticks.x = element_blank()) + 
  labs(x = NULL, y = "LDL-C (mmol/L)")

data_comb %>% 
  filter(time %in% c("base", "end")) %>% 
  select(id, time, group, height:vitd3) %>% 
  pivot_longer(cols = -c(id:group), names_to = "variables", values_to = "value") %>% 
  left_join(annotation$clinic, by = c("variables" = "name.short")) %>% 
  mutate(name.unit = paste0(name.full, " (", unit, ")")) %>% 
  ggplot(aes(time, value)) + 
  geom_line(aes(group = id), color = "grey", size = 0.2) + 
  
  stat_summary(aes(color = group, group = group), fun.y = "mean", geom = "line", 
               position = position_dodge(width = 0.1)) + 
  stat_summary(aes(color = group, group = group), fun.data = "mean_cl_normal", 
               position = position_dodge(width = 0.1)) + 
  
  scale_color_brewer(name = NULL, palette = "Set1") + 
  facet_wrap(~ name.unit, ncol = 4, scales = "free") + 
  theme_classic() + 
  theme(strip.background = element_blank()) + 
  labs(x = NULL, y = NULL)

data_comb %>% 
  filter(time == "delta") %>% 
  select(id, time, group, height:vitd3) %>% 
  pivot_longer(cols = -c(id:group), names_to = "variables", values_to = "value") %>% 
  filter(variables != "height") %>% 
  left_join(annotation$clinic, by = c("variables" = "name.short")) %>% 
  arrange(desc(value)) %>% 
  mutate(name.unit = paste0(name.full, " (", unit, ")"), 
         id.row = row_number() %>% factor()) %>% 
  ggplot(aes(fct_reorder(id.row, value), value, fill = group)) + 
  geom_col() + 
  scale_fill_brewer(name = NULL, palette = "Set1") +
  facet_wrap(~name.unit, ncol = 4, scales = "free") +
  theme_classic() + 
  theme(strip.background = element_blank(), 
        axis.text.x = element_blank(), 
        axis.ticks.x = element_blank()) + 
  labs(x = NULL, y = NULL)

Bivariate

Correlations/heatmap
data_comb %>% 
  filter(time == "base") %>% 
  select(variables$clinic_num) %>% 
  cor(use = "pairwise.complete.obs") %>% 
  as.data.frame() %>% 
  rownames_to_column(var = ".rownames") %>% 
  pivot_longer(-.rownames, names_to = ".colnames", values_to = "r") %>% 
  mutate(r = if_else(r == 1, true = NA_real_, false = r)) %>% 
  left_join(annotation$clinic, by = c(".rownames" = "name.short")) %>% 
  left_join(annotation$clinic, by = c(".colnames" = "name.short")) %>% 
  ggplot(aes(
    x = fct_reorder(name.full.x, order.variable.x), 
    y = fct_reorder(name.full.y, -order.variable.y))) + 
  geom_tile(aes(fill = r), color = "grey90") + 
  scale_fill_distiller(name = "Correlation \ncoefficient", palette = "RdBu", na.value = "grey70") + 
  facet_grid(cols = vars(fct_reorder(facet.group.x, facet.order.x)), 
             rows = vars(fct_reorder(facet.group.y, facet.order.y)),  
             scales = "free", space = "free", 
             labeller = label_wrap_gen(width = 10)) + 
  theme_minimal() + 
  theme(axis.text.x = element_text(angle = 45, hjust = 1), 
        strip.text.y = element_text(angle = 0)) + 
  labs(x = NULL, y = NULL)

Time-based

Let’s plot the variation in vitamin D over the course of the year. We assume that there will be a cyclic variation between summer and winter time.

data_comb %>% 
  select(id, time, date, vitd) %>% 
  filter(time %in% c("base", "end")) %>% 
  ggplot(aes(date, vitd)) + 
  geom_line(aes(group = id), color = "grey80") + 
  geom_point(color = "grey60") + 
  geom_smooth(color = "black") + 
  scale_color_brewer(palette = "Set1") + 
  scale_x_date(date_labels = "%b %Y") + 
  theme_classic() + 
  labs(x = NULL, y = "Vitamin D (nmol/L)")
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

And indeed there is! What about all the other variables?

data_comb %>% 
  select(id, time, date, variables$clinic_num) %>% 
  filter(time %in% c("base", "end")) %>% 
  pivot_longer(cols = variables$clinic_num) %>% 
  left_join(annotation$clinic, by = c("name" = "name.short")) %>% 
  ggplot(aes(date, value)) + 
  geom_line(aes(group = id), color = "grey80") + 
  geom_point(color = "grey60") + 
  geom_smooth(color = "black") + 
  scale_color_brewer(palette = "Set1") + 
  scale_x_date(date_labels = "%b %Y") + 
  facet_wrap(~ name.full, ncol = 4, scales = "free") + 
  theme_classic() + 
  theme(strip.background = element_blank()) + 
  labs(x = NULL, y = NULL)
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

Multivariate

PCA

Run PCA for clinical variables at baseline.

pca_clinic <- data_comb %>% 
  filter(time == "base") %>% 
  select(id, variables$clinic) %>% 
  as.data.frame() %>% 
  column_to_rownames(var = "id") %>% 
  select_if(is.numeric) %>% 
  drop_na() %>% 
  prcomp(center = TRUE, scale = TRUE)

# From broom website: '"u", "samples", "scores", or "x": returns information about the map from the original space into principle components space.'
tidy(pca_clinic, matrix = "samples")
# From broom website: '"v", "rotation", "loadings" or "variables": returns information about the map from principle components space back into the original space.'
tidy(pca_clinic, matrix = "variables")
# From broom website: '"d" or "pcs": returns information about the eigenvalues.'
tidy(pca_clinic, matrix = "pcs")

See broom website for more information.

Let’s plot a few of the standard PCA plot:

  • Scree plot
  • Scores plot (“samples”)
  • Loadings plot (“variables”)
tidy(pca_clinic, matrix = "pcs") %>% 
  pivot_longer(cols = c("std.dev", "percent", "cumulative")) %>% 
  ggplot(aes(PC, value)) + 
  geom_line() + 
  facet_wrap(~ name, scales = "free") + 
  theme_minimal()

tidy(pca_clinic, matrix = "samples") %>% 
  mutate(PC = paste0("PC", PC), 
         row = as.character(row) %>% as.numeric()) %>% 
  pivot_wider(names_from = "PC", values_from = "value") %>% 
  left_join(select(data_comb, id, group), by = c("row" = "id")) %>% 
  ggplot(aes(PC1, PC2, color = group)) + 
  geom_hline(yintercept = 0, color = "grey", linetype = "dashed") + 
  geom_vline(xintercept = 0, color = "grey", linetype = "dashed") + 
  geom_point() + 
  stat_ellipse(linetype = "dashed") + 
  scale_color_brewer(name = NULL, palette = "Set1") + 
  theme_classic()

tidy(pca_clinic, matrix = "variables") %>% 
  mutate(PC = paste0("PC", PC)) %>% 
  pivot_wider(names_from = "PC", values_from = "value") %>% 
  left_join(annotation$clinic, by = c("column" = "name.short")) %>% 
  ggplot(aes(PC1, PC2)) + 
  geom_hline(yintercept = 0, color = "grey", linetype = "dashed") + 
  geom_vline(xintercept = 0, color = "grey", linetype = "dashed") + 
  geom_text(aes(label = column, color = facet.group)) + 
  scale_color_brewer(name = NULL, palette = "Dark2") + 
  theme_classic()

Clustering

Add clustering figures??

Nightingale data

Visualizations

Univariate

Bivariate

Correlations/heatmap

An overview heatmap.

data_comb %>% 
  filter(time == "base") %>% 
  select(variables$nightingale) %>% 
  cor(use = "pairwise.complete.obs") %>% 
  pheatmap::pheatmap(show_rownames = FALSE, show_colnames = FALSE)

Or we can ask specific questions:

  1. Does Nightingale ‘standard’ mesurements correlate with clinical biochemistry?
data_comb %>% 
  filter(time == "base") %>% 
  ggplot(aes(.panel_x, .panel_y)) + 
  geom_point() + 
  geom_smooth(se = FALSE) + 
  facet_matrix(cols = vars(tc, tg, ldlc, hdlc, apob, apoa1), 
               rows = vars(`Serum-C`, `Serum-TG`, `LDL-C`, `HDL-C`, ApoB, ApoA1)) + 
  theme_bw() + 
  labs(x = "Clinical measurements", y = "Nightingale measurements")
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

  1. Are the specific lipid species really just reflective of the Particle concentration or Total lipids?
data_comb %>% 
  filter(time == "base") %>% 
  ggplot(aes(.panel_x, .panel_y)) + 
  geom_point() + 
  geom_smooth(se = FALSE) + 
  facet_matrix(cols = vars(matches("-P$")), 
               rows = vars(matches("-L$"))) + 
  theme_bw() + 
  theme(strip.text.y = element_text(angle = 0))
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

data_night_cor <- data_comb %>% 
  filter(time == "base") %>% 
  select(variables$nightingale) %>% 
  cor(use = "pairwise.complete.obs") %>% 
  as.data.frame() %>% 
  rownames_to_column(var = "var1") %>% 
  pivot_longer(cols = -var1, names_to = "var2", values_to = "r")

Let’s look at the distribution of correlation coefficients for all variable combinations.

data_night_cor %>% 
  ggplot(aes(r)) + 
  geom_histogram(bins = 100)

Those that are exactly 1 is of course the combination of the same variables. Let’s split into types of variables.

data_night_cor %>% 
  left_join(annotation$nightingale, by = c("var1" = "name.short")) %>% 
  left_join(annotation$nightingale, by = c("var2" = "name.short"), suffix = c("_var1", "_var2")) %>% 
  filter(unit_var1 %in% c("%", "mmol/l"), unit_var2 %in% c("%", "mmol/l")) %>% 
  ggplot(aes(r)) + 
  geom_histogram() + 
  facet_grid(rows = vars(unit_var1), cols = vars(unit_var2)) + 
  labs(x = "Correlation coefficient", y = "Count")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

data_night_cor_particle <- data_night_cor %>% 
  filter(str_detect(var1, "-P$")) %>% 
  left_join(select(annotation$nightingale, name.short, size), by = c("var2" = "name.short")) %>% 
  filter(!is.na(size)) %>% select(-size) %>% 
  mutate(var1 = str_replace_all(var1, "-P", ""))

data_night_cor_lipids <- data_night_cor %>% 
  filter(str_detect(var1, "-L$")) %>% 
  left_join(select(annotation$nightingale, name.short, size), by = c("var2" = "name.short")) %>% 
  filter(!is.na(size)) %>% select(-size) %>% 
  mutate(var1 = str_replace_all(var1, "-L", ""))

data_night_cor_comb <- left_join(
  data_night_cor_particle, 
  data_night_cor_lipids, 
  by = c("var1", "var2"), 
  suffix = c("_particles", "_lipids")
)
data_night_cor_comb %>% 
  ggplot(aes(r_particles, r_lipids)) + 
  geom_point() + 
  theme_classic() + 
  labs(x = "Correlation between 'Particle concentration' and lipid species", 
       y = "Correlation between 'Total lipids' and lipid species")

data_night_cor_species <- data_night_cor %>% 
  left_join(annotation$nightingale, by = c("var1" = "name.short")) %>% 
  left_join(annotation$nightingale, by = c("var2" = "name.short"), suffix = c("_var1", "_var2")) %>% 
  filter(!is.na(size_var1), !is.na(size_var2), 
         str_detect(var1, "-P$|-L$", negate = TRUE), 
         str_detect(var2, "-P$|-L$", negate = TRUE)
         ) %>% 
  mutate(var1 = str_replace_all(var1, "_%", ""), 
         var2 = str_replace_all(var2, "_%", ""), 
         type_var1 = abbreviate(type_var1, minlength = 2L) %>% str_to_upper(), 
         type_var2 = abbreviate(type_var2, minlength = 2L) %>% str_to_upper())
data_night_cor_species %>% 
  ggplot(aes(x = fct_reorder(var1, name.order_var1), 
             y = fct_reorder(var2, -name.order_var2), fill = r)) + 
  geom_tile() + 
  scale_fill_gradient2() + 
  facet_grid(cols = vars(fct_rev(unit_var1), type_var1), 
             rows = vars(fct_rev(unit_var2), type_var2), 
             scales = "free", space = "free") + 
  theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 8), 
        axis.text.y = element_text(size = 8)) +
  labs(x = NULL, y = NULL)

Clearly, lipid species associate strongly with each other for absolute values, but not for relative values (%).

Multivariate

PCA

Run PCA for all Nightingale variables at baseline and end-of-study.

# Baseline
pca_nightingale_base <- data_comb %>% 
  filter(time == "base") %>% 
  select(id, variables$nightingale) %>% 
  as.data.frame() %>% 
  column_to_rownames(var = "id") %>% 
  select_if(is.numeric) %>% 
  drop_na() %>% 
  prcomp(center = TRUE, scale = TRUE)

# End-of-study
pca_nightingale_end <- data_comb %>% 
  filter(time == "end") %>% 
  select(id, variables$nightingale) %>% 
  as.data.frame() %>% 
  column_to_rownames(var = "id") %>% 
  select_if(is.numeric) %>% 
  drop_na() %>% 
  prcomp(center = TRUE, scale = TRUE)

Again, see the broom website for more information.

Let’s plot a few of the standard PCA plot:

  • Scree plot
  • Scores plot (“samples”)

It’s no use plotting the loadings plot (“variables”), since there are so many variables. We would have drastic overplotting as a result.

tidy(pca_nightingale_base, matrix = "pcs") %>% 
  pivot_longer(cols = c("std.dev", "percent", "cumulative")) %>% 
  ggplot(aes(PC, value)) + 
  geom_line() + 
  facet_wrap(~ name, scales = "free") + 
  theme_minimal()

pca_nightingale_scores_prep <- bind_rows(
  tidy(pca_nightingale_base, matrix = "samples") %>% 
    mutate(PC = paste0("PC", PC), 
           row = as.character(row) %>% as.numeric(), 
           time = "base") %>% 
    filter(PC %in% c("PC1", "PC2")) %>% 
    pivot_wider(names_from = "PC", values_from = "value"), 
  tidy(pca_nightingale_end, matrix = "samples") %>% 
    mutate(PC = paste0("PC", PC), 
           row = as.character(row) %>% as.numeric(), 
           time = "end") %>% 
    filter(PC %in% c("PC1", "PC2")) %>% 
    pivot_wider(names_from = "PC", values_from = "value")) %>% 
  left_join(select(data_comb, id, group, time), by = c("row" = "id", "time"))

pca_nightingale_scores_prep %>% 
  ggplot(aes(PC1, PC2)) + 
  geom_hline(yintercept = 0, color = "grey", linetype = "dashed") + 
  geom_vline(xintercept = 0, color = "grey", linetype = "dashed") + 
  geom_point(aes(color = time)) + 
  stat_ellipse(aes(color = time), linetype = "dashed") +
  scale_color_brewer(name = NULL, palette = "Dark2") + 
  facet_wrap(~ group) + 
  theme_classic() + 
  theme(strip.background = element_blank(), 
        legend.position = "top")

Clustering

Here I can add something about clustering techniques.

Conclusions

This concludes the script. Some conclusions and take-homes:

  • ggplot2 is great for EDA!
  • EDA is essential to truly understand your data
  • Spend time playing around with different visualizations
  • Do you have a hypothesis? Follow that through visualizations of raw data and/or analyses results!

Session info

To improve reproducibility, print out the session info for this script.

devtools::session_info()
## - Session info ----------------------------------------------------------
## 
## - Packages --------------------------------------------------------------
##  package      * version  date       lib source        
##  acepack        1.4.1    2016-10-29 [1] CRAN (R 3.6.0)
##  assertthat     0.2.1    2019-03-21 [1] CRAN (R 3.6.0)
##  backports      1.1.4    2019-04-10 [1] CRAN (R 3.6.0)
##  base64enc      0.1-3    2015-07-28 [1] CRAN (R 3.6.0)
##  broom        * 0.5.2    2019-04-07 [1] CRAN (R 3.6.0)
##  callr          3.3.0    2019-07-04 [1] CRAN (R 3.6.1)
##  cellranger     1.1.0    2016-07-27 [1] CRAN (R 3.6.0)
##  checkmate      1.9.4    2019-07-04 [1] CRAN (R 3.6.1)
##  class          7.3-15   2019-01-01 [2] CRAN (R 3.6.0)
##  cli            1.1.0    2019-03-19 [1] CRAN (R 3.6.0)
##  cluster        2.0.8    2019-04-05 [2] CRAN (R 3.6.0)
##  colorspace     1.4-1    2019-03-18 [1] CRAN (R 3.6.0)
##  crayon         1.3.4    2017-09-16 [1] CRAN (R 3.6.0)
##  data.table     1.12.2   2019-04-07 [1] CRAN (R 3.6.0)
##  DBI            1.0.0    2018-05-02 [1] CRAN (R 3.6.0)
##  desc           1.2.0    2018-05-01 [1] CRAN (R 3.6.0)
##  devtools       2.1.0    2019-07-06 [1] CRAN (R 3.6.1)
##  digest         0.6.20   2019-07-04 [1] CRAN (R 3.6.1)
##  dplyr        * 0.8.2    2019-06-29 [1] CRAN (R 3.6.0)
##  e1071          1.7-2    2019-06-05 [1] CRAN (R 3.6.0)
##  ellipsis       0.2.0.1  2019-07-02 [1] CRAN (R 3.6.1)
##  evaluate       0.14     2019-05-28 [1] CRAN (R 3.6.0)
##  fansi          0.4.0    2018-10-05 [1] CRAN (R 3.6.0)
##  farver         1.1.0    2018-11-20 [1] CRAN (R 3.6.1)
##  forcats      * 0.4.0    2019-02-17 [1] CRAN (R 3.6.0)
##  foreign        0.8-71   2018-07-20 [2] CRAN (R 3.6.0)
##  Formula        1.2-3    2018-05-03 [1] CRAN (R 3.6.0)
##  fs             1.3.1    2019-05-06 [1] CRAN (R 3.6.0)
##  generics       0.0.2    2018-11-29 [1] CRAN (R 3.6.0)
##  ggforce      * 0.3.1    2019-08-20 [1] CRAN (R 3.6.1)
##  ggplot2      * 3.2.1    2019-08-10 [1] CRAN (R 3.6.1)
##  ggrepel        0.8.1    2019-05-07 [1] CRAN (R 3.6.1)
##  glue           1.3.1    2019-03-12 [1] CRAN (R 3.6.0)
##  gridExtra    * 2.3      2017-09-09 [1] CRAN (R 3.6.1)
##  gtable         0.3.0    2019-03-25 [1] CRAN (R 3.6.0)
##  haven        * 2.1.1    2019-07-04 [1] CRAN (R 3.6.1)
##  highr          0.8      2019-03-20 [1] CRAN (R 3.6.0)
##  Hmisc          4.2-0    2019-01-26 [1] CRAN (R 3.6.0)
##  hms            0.5.0    2019-07-09 [1] CRAN (R 3.6.0)
##  htmlTable      1.13.1   2019-01-07 [1] CRAN (R 3.6.0)
##  htmltools      0.3.6    2017-04-28 [1] CRAN (R 3.6.0)
##  htmlwidgets    1.3      2018-09-30 [1] CRAN (R 3.6.0)
##  httr           1.4.0    2018-12-11 [1] CRAN (R 3.6.0)
##  inline         0.3.15   2018-05-18 [1] CRAN (R 3.6.1)
##  jsonlite       1.6      2018-12-07 [1] CRAN (R 3.6.0)
##  knitr        * 1.23     2019-05-18 [1] CRAN (R 3.6.0)
##  labeling       0.3      2014-08-23 [1] CRAN (R 3.6.0)
##  labelled       2.2.1    2019-05-26 [1] CRAN (R 3.6.1)
##  lattice        0.20-38  2018-11-04 [2] CRAN (R 3.6.0)
##  latticeExtra   0.6-28   2016-02-09 [1] CRAN (R 3.6.0)
##  lazyeval       0.2.2    2019-03-15 [1] CRAN (R 3.6.0)
##  lifecycle      0.1.0    2019-08-01 [1] CRAN (R 3.6.1)
##  loo            2.1.0    2019-03-13 [1] CRAN (R 3.6.1)
##  lubridate      1.7.4    2018-04-11 [1] CRAN (R 3.6.0)
##  magrittr       1.5      2014-11-22 [1] CRAN (R 3.6.0)
##  MASS           7.3-51.4 2019-03-31 [2] CRAN (R 3.6.0)
##  Matrix         1.2-17   2019-03-22 [2] CRAN (R 3.6.0)
##  matrixStats    0.54.0   2018-07-23 [1] CRAN (R 3.6.1)
##  memoise        1.1.0    2017-04-21 [1] CRAN (R 3.6.0)
##  mitools        2.4      2019-04-26 [1] CRAN (R 3.6.1)
##  modelr         0.1.4    2019-02-18 [1] CRAN (R 3.6.0)
##  munsell        0.5.0    2018-06-12 [1] CRAN (R 3.6.0)
##  nlme           3.1-139  2019-04-09 [2] CRAN (R 3.6.0)
##  nnet           7.3-12   2016-02-02 [2] CRAN (R 3.6.0)
##  openxlsx     * 4.1.2    2019-10-29 [1] CRAN (R 3.6.1)
##  packrat        0.5.0    2018-11-14 [1] CRAN (R 3.6.1)
##  pheatmap       1.0.12   2019-01-04 [1] CRAN (R 3.6.0)
##  pillar         1.4.2    2019-06-29 [1] CRAN (R 3.6.0)
##  pkgbuild       1.0.3    2019-03-20 [1] CRAN (R 3.6.0)
##  pkgconfig      2.0.2    2018-08-16 [1] CRAN (R 3.6.0)
##  pkgload        1.0.2    2018-10-29 [1] CRAN (R 3.6.0)
##  plyr           1.8.4    2016-06-08 [1] CRAN (R 3.6.0)
##  polyclip       1.10-0   2019-03-14 [1] CRAN (R 3.6.0)
##  prettyunits    1.0.2    2015-07-13 [1] CRAN (R 3.6.0)
##  processx       3.4.0    2019-07-03 [1] CRAN (R 3.6.1)
##  ps             1.3.0    2018-12-21 [1] CRAN (R 3.6.0)
##  purrr        * 0.3.2    2019-03-15 [1] CRAN (R 3.6.0)
##  R6             2.4.0    2019-02-14 [1] CRAN (R 3.6.0)
##  RColorBrewer   1.1-2    2014-12-07 [1] CRAN (R 3.6.0)
##  Rcpp           1.0.1    2019-03-17 [1] CRAN (R 3.6.0)
##  readr        * 1.3.1    2018-12-21 [1] CRAN (R 3.6.0)
##  readxl       * 1.3.1    2019-03-13 [1] CRAN (R 3.6.0)
##  remotes        2.1.0    2019-06-24 [1] CRAN (R 3.6.0)
##  reshape2       1.4.3    2017-12-11 [1] CRAN (R 3.6.0)
##  rlang          0.4.0    2019-06-25 [1] CRAN (R 3.6.0)
##  rmarkdown    * 1.13     2019-05-22 [1] CRAN (R 3.6.0)
##  rpart          4.1-15   2019-04-12 [2] CRAN (R 3.6.0)
##  rprojroot      1.3-2    2018-01-03 [1] CRAN (R 3.6.0)
##  rstan          2.19.2   2019-07-09 [1] CRAN (R 3.6.1)
##  rstudioapi     0.10     2019-03-19 [1] CRAN (R 3.6.0)
##  rvest          0.3.4    2019-05-15 [1] CRAN (R 3.6.0)
##  scales         1.0.0    2018-08-09 [1] CRAN (R 3.6.0)
##  sessioninfo    1.1.1    2018-11-05 [1] CRAN (R 3.6.0)
##  StanHeaders    2.19.0   2019-09-07 [1] CRAN (R 3.6.1)
##  stringi        1.4.3    2019-03-12 [1] CRAN (R 3.6.0)
##  stringr      * 1.4.0    2019-02-10 [1] CRAN (R 3.6.0)
##  survey         3.36     2019-04-27 [1] CRAN (R 3.6.1)
##  survival       2.44-1.1 2019-04-01 [2] CRAN (R 3.6.0)
##  tableone       0.10.0   2019-02-17 [1] CRAN (R 3.6.1)
##  testthat       2.1.1    2019-04-23 [1] CRAN (R 3.6.1)
##  tibble       * 2.1.3    2019-06-06 [1] CRAN (R 3.6.0)
##  tidyr        * 1.0.0    2019-09-11 [1] CRAN (R 3.6.1)
##  tidyselect     0.2.5    2018-10-11 [1] CRAN (R 3.6.0)
##  tidyverse    * 1.2.1    2017-11-14 [1] CRAN (R 3.6.1)
##  tinytex        0.14     2019-06-25 [1] CRAN (R 3.6.0)
##  tweenr         1.0.1    2018-12-14 [1] CRAN (R 3.6.1)
##  usethis        1.5.1    2019-07-04 [1] CRAN (R 3.6.1)
##  utf8           1.1.4    2018-05-24 [1] CRAN (R 3.6.0)
##  vctrs          0.2.0    2019-07-05 [1] CRAN (R 3.6.1)
##  withr          2.1.2    2018-03-15 [1] CRAN (R 3.6.0)
##  xfun           0.8      2019-06-25 [1] CRAN (R 3.6.0)
##  xml2           1.2.0    2018-01-24 [1] CRAN (R 3.6.0)
##  yaml           2.2.0    2018-07-25 [1] CRAN (R 3.6.0)
##  zeallot        0.1.0    2018-01-28 [1] CRAN (R 3.6.0)
##  zip            2.0.4    2019-09-01 [1] CRAN (R 3.6.1)
##  zoo            1.8-6    2019-05-28 [1] CRAN (R 3.6.1)
## 
## [1] C:/Users/jacobjc/Documents/R/win-library/3.6
## [2] C:/Program Files/R/R-3.6.0/library